class: center, middle, inverse, title-slide # Cleaning, Summarizing, and Visualizing Data ### Jeroen Mahieu ###
jeroen.mahieu@vu.nl
### Updated 2022-02-09 --- # The Data Science Pipeline * Quantitative Research is about numeric `data` <img src="data:image/png;base64,#C:/Users/jeroe/Dropbox/VU teaching/R-workshop/_site/images/data-science-pipeline.png" width="400px" style="display: block; margin: auto;" /> --- # Cleaning (Tidying) Data * According a to [2014 NYTimes article](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html), "data scientists [...] spend from ***50 percent to 80 percent of their time*** mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets." * Luckily we have some powerful tools to help us out. * Here, we will focus on [`dplyr`](https://dplyr.tidyverse.org) which is part of the [`tidyverse`](https://www.tidyverse.org) * (When you work with large datasets (+100k rows with many columns) learn to use [`data.table`](https://github.com/Rdatatable/data.table/wiki) which is much faster but has more difficult syntax) --- # `dplyr` Overview * You are ***highly encouraged*** to read through [Hadley Wickham's chapter](https://r4ds.had.co.nz/transform.html). It's clear and concise. -- * Also check out this great "cheatsheet" [here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf). -- * The package is organized around a set of **verbs**, i.e. *actions* to be taken. * We operate on `data.frames` or `tibbles` (*nicer looking* data.frames.)